52  Stationarity and Differencing - Practical

52.1 Introduction

We have examined some of the basic questions in time-series analysis, specifically are there clear trends in our data over time, and does there appear to be a seasonal influence in the data?

However, there are some key aspects of time-series data that are important to explore before we begin to undertake any statistical analysis of the data.

Recall from the previous section:

  • Stationarity is a characteristic of time series data, one that we are usually hoping to observe.

  • Differencing is a technique that we can use to make non-stationary data into stationary.

In this section, we’ll explore both concepts in detail.

52.2 Examining stationarity 1: demonstration

Create a synthetic dataset

I’m going to start by creating a synthetic dataset that is stationary.

rm(list=ls()) # clear environment

# Load necessary libraries
library(forecast)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(ggplot2)
library(tseries)

# Function to generate a stationary time series
generate_stationary_ts <- function(n) {
  # AR(1) process with a coefficient that ensures stationarity (absolute value < 1)
  arima.sim(model = list(ar = 0.5), n = n)
}

# Generate time series data
stationary_ts <- generate_stationary_ts(120)
rm(generate_stationary_ts)

Prepare the dataset

Now, I’m going to prepare that data for analysis. Notice that I have ‘told’ R that the data is in monthly format (12 observations per year) and that the start date is January 2023.

Rather than create a new object, I have simply replaced the existing dataframe stationary_ts with a time-series object of the same name.

# Convert to ts object
stationary_ts <- ts(stationary_ts, frequency = 12, start = c(2010, 1)) # This includes data starts from Jan 2010

Visual inspection

Show code
# Plot the generated time series data
plot(stationary_ts, main = "Plot of Time Series Data (mean in red)", xlab = "Time (year)", ylab = "Value")

# add the mean value

# Calculate mean of the time series
mean_value <- mean(stationary_ts, na.rm = TRUE)  # na.rm = TRUE ensures NA values are ignored

# Add a horizontal line at the mean value
abline(h = mean_value, col = "red", lwd = 2)  # 'h' spe

Note that the plotted figure shows no clear trends. It seems to ‘hover’ around the mean. This suggests that the data is stationary.

Testing for stationarity

The Augmented Dickey-Fuller (ADF) test is often used to test for stationarity. It tests the null hypothesis that a unit root is present in a time series sample.

library(urca)

# Perform the Augmented Dickey-Fuller Test
adf_test <- ur.df(stationary_ts, type = "drift", lags = 1)

# View the summary of the test, including critical values
summary(adf_test)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.17411 -0.65466 -0.04125  0.69066  2.52076 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02649    0.09238   0.287    0.775    
z.lag.1     -0.66179    0.10430  -6.345 4.54e-09 ***
z.diff.lag   0.06826    0.09370   0.728    0.468    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.003 on 115 degrees of freedom
Multiple R-squared:  0.3096,    Adjusted R-squared:  0.2976 
F-statistic: 25.79 on 2 and 115 DF,  p-value: 5.6e-10


Value of test-statistic is: -6.3453 20.1393 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1  6.52  4.63  3.81

What does this output tell us?

  • If the test statistic is less (more negative) than the critical value at a certain significance level (e.g., -3.45 < -2.86 at the 5% level), and/or if the p-value is small (typically < 0.05), you can reject the null hypothesis. This suggests that the time series is stationary.

  • If the test statistic is greater (less negative) than the critical value at the chosen significance level (e.g., -2.50 > -2.86 at the 5% level), and/or if the p-value is large (typically > 0.05), you cannot reject the null hypothesis. This suggests that the time series is non-stationary.

In the case above, the test statistic (-5.6412) is less than the critical value at 5% level (-2.88) and the p-value is < 0.01. We can reject the null hypothesis, and conclude that our data is stationary.

52.3 Examining stationarity 2: demonstration

Create a synthetic dataset

I’m going to create another synthetic dataset that has a clear trend (and in-built seasonality).

Show code to create synthetic dataset
rm(list=ls())

# Load necessary library
library(forecast)

# Set seed for reproducibility
set.seed(1234)

# Generate Time Series Data
# Creating a time series with trend and seasonality
time_points <- 120 # e.g., 120 months (10 years of monthly data)
trend_component <- seq(1, time_points, by = 1)
seasonal_component <- sin(seq(1, time_points, by = 1) * 2 * pi / 12) * 5
random_noise <- rnorm(time_points, mean = 0, sd = 2)

# Combine components to create final time series data
ts_data <- trend_component + seasonal_component + random_noise
rm(random_noise, seasonal_component, time_points, trend_component)

Prepare the dataset

Now, I’m going to prepare that data for analysis

# Convert to ts object
ts_data <- ts(ts_data, frequency = 12, start = c(2010, 1)) # This includes data starts from Jan 2010

Visual inspection

# Plot time series data
plot(ts_data, main = "Plot of Time Series Data", xlab = "Time (year)", ylab = "Value")

Note that the plotted figure shows a clear upward trend in the data over time. Also, note that the data seems to rise, fall, rise more, fall and on on.

The appearance of a trend in our data raises suspicions that the data is non-stationary.

Testing for stationarity

Visual inspection is always useful, but statistical tests are usually preferable.

As you know, the Augmented Dickey-Fuller (ADF) test is commonly used to test for stationarity. It tests the null hypothesis that a unit root is present in a time series sample.

Show code
library(urca)

# Perform the Augmented Dickey-Fuller Test
adf_test <- ur.df(ts_data, type = "drift", lags = 1)

# View the summary of the test, including critical values
summary(adf_test)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
   Min     1Q Median     3Q    Max 
-8.277 -2.077 -0.078  2.209  6.635 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  1.065902   0.587177   1.815   0.0721 .
z.lag.1     -0.001868   0.008417  -0.222   0.8247  
z.diff.lag   0.004755   0.092427   0.051   0.9591  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.143 on 115 degrees of freedom
Multiple R-squared:  0.0004434, Adjusted R-squared:  -0.01694 
F-statistic: 0.02551 on 2 and 115 DF,  p-value: 0.9748


Value of test-statistic is: -0.2219 4.9703 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1  6.52  4.63  3.81

What does this output tell us?

  • If the test statistic is less (more negative) than the critical value at a certain significance level (e.g., -3.45 < -2.86 at the 5% level), and/or if the p-value is small (typically < 0.05), you can reject the null hypothesis. This suggests that the time series is stationary.

  • If the test statistic is greater (less negative) than the critical value at the chosen significance level (e.g., -2.50 > -2.86 at the 5% level), and/or if the p-value is large (typically > 0.05), you cannot reject the null hypothesis. This suggests that the time series is non-stationary.

In the case above, the test statistic (-0.2219) is greater than the critical value at 5% level (-2.88) and the p-value is >0.05. We cannot reject the null hypothesis, and therefore must conclude that our data is non-stationary. This confirms what we suspected based on visual inspection.

52.4 Differencing the time series: demonstration

So we have a problem with this second data set. It is non-stationary. We can’t proceed with further statistical analysis of time-series data until this is addressed!

  • Differencing is a method of transforming a non-stationary time series into a stationary one. This is done by subtracting the previous observation from the current observation.

  • Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, thus eliminating (or reducing) trend and seasonality.

First differencing

By applying first-order differencing to the object ts_data, we achieve a visual pattern that suggests stationarity in the data.

diff_ts_data <- diff(ts_data, differences = 1) # create new object with differencing

# Plot the differenced data

plot(diff_ts_data, main="First Differenced Data", xlab="Time", ylab="Differenced Value")

Visually, this looks more like what we would expect if our data was stationary. We can check this statistically:

library(urca)

# Perform the Augmented Dickey-Fuller Test
adf_test <- ur.df(diff_ts_data, type = "drift", lags = 1)

# View the summary of the test, including critical values
summary(adf_test)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
   Min     1Q Median     3Q    Max 
-8.201 -2.323 -0.004  2.184  6.425 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.81837    0.31456   2.602   0.0105 *  
z.lag.1     -0.87587    0.13056  -6.708 7.97e-10 ***
z.diff.lag  -0.13026    0.09206  -1.415   0.1598    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.123 on 114 degrees of freedom
Multiple R-squared:  0.5125,    Adjusted R-squared:  0.5039 
F-statistic: 59.92 on 2 and 114 DF,  p-value: < 2.2e-16


Value of test-statistic is: -6.7084 22.5034 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1  6.52  4.63  3.81

The ADF test confirms we’ve achieved our goal. Note that the p-value is < 0.01, and the test statistic is less than the cut value for 5%.

Additional note: seasonal differencing

If the time-series has a seasonal pattern, then seasonal differencing can be used. Here, the observation from the previous season is subtracted from the current observation.

Show code
s <- 12 # Example for monthly data with an annual cycle 
seasonal_diff_ts_data <- diff(ts_data, lag = s, differences = 1)

# Plot the seasonally differenced data

plot(seasonal_diff_ts_data, main="Seasonally Differenced Data", xlab="Time", ylab="Seasonally Differenced Value")

52.5 Stationarity and differencing: practice

Download the dataset here:

df <- read.csv('https://www.dropbox.com/scl/fi/sw8e1k7lknsg4tcq1pgtc/ns_ts.csv?rlkey=jqtaokuw76uct9gcajzcumt3j&dl=1')
  • Step One: Remove the Time column from the dataframe.

  • Step Two: Convert the dataframe df to a time-series object. Use frequency = 12.

  • Step Three: Plot the time-series object.

    • Consider: does the time-series appear to be stationary or non-stationary?

    • Consider: does the time-series appear to have seasonality or not?

  • Step Four: Test the time-series object for stationarity using the ADF test.

  • Step Five: If required (based on the ADF), apply differencing to the time-series to address issues of stationarity.

  • Step Six: Test new time-series data for stationarity using the ADF.

  • Step Seven: If stationarity has been achieved, decompose the new time-series data.

    • Consider: Is there a seasonal component to the data?
Show solution
# remove time column

df$Time <- NULL

# convert dataframe to time-series

df_ts <- ts(df, frequency = 12)

# plot df_ts

plot(df_ts, type = "o", col = "blue", main = "Time Series", xlab = "Time", ylab = "Value")

Show solution
# test for stationarity

library(urca)

# Perform the Augmented Dickey-Fuller Test
adf_test <- ur.df(df_ts, type = "drift", lags = 1)

# View the summary of the test, including critical values
summary(adf_test)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.2588  -4.5697   0.7262   4.4189  21.1632 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  2.45705    1.43932   1.707   0.0911 .
z.lag.1     -0.05862    0.04004  -1.464   0.1464  
z.diff.lag  -0.11745    0.10277  -1.143   0.2560  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.41 on 95 degrees of freedom
Multiple R-squared:  0.04412,   Adjusted R-squared:  0.02399 
F-statistic: 2.192 on 2 and 95 DF,  p-value: 0.1173


Value of test-statistic is: -1.4642 1.4572 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.51 -2.89 -2.58
phi1  6.70  4.71  3.86
Show solution
# if needed, apply differencing and plot 
diff_df_ts <- diff(df_ts, differences = 1) # create new object with differencing

# Plot the differenced data

plot(diff_df_ts, main="First Differenced Data", xlab="Time", ylab="Differenced Value")

Show solution
# check new data with ADF

adf_test <- ur.df(diff_df_ts, type = "drift", lags = 1)

# View the summary of the test, including critical values
summary(adf_test)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.3992  -5.2393   0.5585   5.0858  20.5987 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.59128    0.76237   0.776    0.440    
z.lag.1     -1.21538    0.15581  -7.800 8.36e-12 ***
z.diff.lag   0.04696    0.10249   0.458    0.648    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.435 on 94 degrees of freedom
Multiple R-squared:  0.5851,    Adjusted R-squared:  0.5763 
F-statistic: 66.29 on 2 and 94 DF,  p-value: < 2.2e-16


Value of test-statistic is: -7.8003 30.4705 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.51 -2.89 -2.58
phi1  6.70  4.71  3.86
Show solution
# if stationarity has been achieved, decompose the time-series


# Seasonal and Trend Decomposition
decomposed_data <- decompose(diff_df_ts)
plot(decomposed_data)